Image Captioning

Shan-Hung Wu & DataLab
Fall 2022

In the last Lab, you have learned how to implement machine translation, where the task is to transform a sentence S written in a source language, into its translation T in the target language.

The model architecture in machine translation is intuitionistic. An “encoder” RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn used as the initial hidden state of a “decoder” RNN that generates the target sentence.

So, what if we look at the images instead of reading the sentences in encoder? That is to say, we use a combination of convolutional neural networks to obtain the vectorial representation of images and recurrent neural networks to decode those representations into natural language sentences. The description must capture not only the objects contained in an image, but it also must express how these objects relate to each other as well as their attributes and the activities they are involved in.

This is the Image Captioning, a very important challenge for machine learning algorithms, as it amounts to mimicking the remarkable human ability to compress huge amounts of salient visual information into descriptive language.

m-RNN

This paper presents a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images.

To the best of our knowledge, this is the first work that incorporates the Recurrent Neural Network in a deep multimodal architecture.

The whole m-RNN architecture contains 3 parts: a language model part, an image part and a multimodal part.

It must be emphasized that:

  1. The image part is AlexNet, which connects the seventh layer of AlexNet to the multimodal layer.
  2. This model feeds the image at each time step.

NIC

This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model uses the encoder-decoder framework of machine translation, which replaces the encoder RNN by a deep convolution neural network.

It must be emphasized that:

  1. The model uses a more powerful CNN in the encoder, which yields the best performance on the ILSVRC 2014 classification competition at that time.
  2. To deal with vanishing and exploding gradients, LSTM was introduced in the decoder to generate sentences based on the fixed-length vector representations from CNN.
  3. The image is only input once, at t = -1, to inform the LSTM about the image contents.
    We empirically verified that feeding the image at each time step as an extra input yields inferior results, as the network can explicitly exploit noise in the image and overfits more easily.

Attention-Based

One of the most curious facets of the human visual system is the presence of attention. Rather than compress an entire image into a static representation, attention allows for salient features to dynamically come to the forefront as needed. This is especially important when there is a lot of clutter in an image.

Using representations (such as those from the top layer of a convnet) that distill information in image down to the most salient objects is one effective solution that has been widely adopted in previous work. Unfortunately, this has one potential drawback of losing information which could be useful for richer, more descriptive captions.

Using more low-level representation can help preserve this information. However, working with these features necessitates a powerful mechanism to steer the model to information important to the task at hand.

This paper describes approaches to caption generation that attempt to incorporate a form of attention with two variants:

  1. a “soft” deterministic attention mechanism trainable by standard back-propagation methods
  2. a “hard” stochastic attention mechanism trainable by maximizing an approximate variational lower bound or equivalently by REINFORCE

The paper above introducing the attention-based model was published at ICML-2015. Since then, there has been a lot of models developed by researchers, and the state of the art was broken again and again.

If you are interested in this field, you can also click on this link, which summarizes many excellent papers in image captioning.

Now, let's begin with our implementation.

Image Captioning

Given an image like the example below, our goal is to generate a caption such as "a surfer riding on a wave".

Man Surfing

Image Source; License: Public Domain

To accomplish this, you'll use an attention-based model, which enables us to see what parts of the image the model focuses on as it generates a caption.

Prediction

The model architecture is similar to Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.

This notebook is an end-to-end example. When you run the notebook, it downloads the MS-COCO dataset, preprocesses and caches a subset of images using Inception V3, trains an encoder-decoder model, and generates captions on new images using the trained model.

Download and prepare the MS-COCO dataset

You will use the MS-COCO dataset to train our model. The dataset contains over 82,000 images, each of which has at least 5 different caption annotations. The code below downloads and extracts the dataset automatically.

Caution: large download ahead. You'll use the training set, which is a 13GB file.

Optional: limit the size of the training set

To speed up training for this tutorial, you'll use a subset of 30,000 captions and their corresponding images to train our model. Choosing to use more data would result in improved captioning quality.

There are tottal 82,783 images with 414,113 captions, but we only use 30,000 captions and their corresponding images to train our model.

Preprocess the images using InceptionV3

Next, you will use InceptionV3 (which is pretrained on Imagenet) to classify each image. You will extract features from the last convolutional layer.

First, you will convert the images into InceptionV3's expected format by:

Initialize InceptionV3 and load the pretrained Imagenet weights

Now you'll create a tf.keras model where the output layer is the last convolutional layer in the InceptionV3 architecture. The shape of the output of this layer is 8x8x2048. You use the last convolutional layer because you are using attention in this example. You don't perform this initialization during training because it could become a bottleneck.

Caching the features extracted from InceptionV3

You will pre-process each image with InceptionV3 and cache the output to disk. Caching the output in RAM would be faster but also memory intensive, requiring 8 * 8 * 2048 floats per image. At the time of writing, this exceeds the memory limitations of Colab (currently 12GB of memory).

Performance could be improved with a more sophisticated caching strategy (for example, by sharding the images to reduce random access disk I/O), but that would require more code.

The caching will take about 10 minutes to run in Colab with a GPU. If you'd like to see a progress bar, you can:

  1. install tqdm:

    !pip install -q tqdm

  2. Import tqdm:

    from tqdm import tqdm

  3. Change the following line:

    for img, path in image_dataset:

    to:

    for img, path in tqdm(image_dataset):

Preprocess and tokenize the captions

Split the data into training and testing

Create a tf.data dataset for training

Our images and captions are ready! Next, let's create a tf.data dataset to use for training our model.

Model

Fun fact: the decoder below is identical to the one in the example for Neural Machine Translation with Attention.

The model architecture is inspired by the Show, Attend and Tell paper.

Checkpoint

Training

Caption!

Try it on your own images

For fun, below we've provided a method you can use to caption your own images with the model we've just trained. Keep in mind, it was trained on a relatively small amount of data, and your images may be different from the training data (so be prepared for weird results!)

Assignment

CAPTCHA (an acronym for "Completely Automated Public Turing test to tell Computers and Humans Apart") is a type of challenge–response test used in computing to determine whether or not the user is human, which is a popular tool since it prevents spam attacks and protects websites from bots.

Captcha asks you to complete the math equation like addition, subtraction, or multiplication. The equation can show up using numbers, letters, or images, therefore bots have no chances.

reCAPTCHA is a CAPTCHA-like system designed to establish that a computer user is human (normally in order to protect websites from bots) and, at the same time, assist in the digitization of books or improve machine learning (even its slogan was “Stop Spam, Read Books”).

It supports two versions. The first asks you to enter some words or digits from the image, and the second asks you to mark the checkbox “I’m not a robot”.

So, as you can see, both tools provide the same functionality increasing the security level of your websites, but the way of how it looks is different. For the more, they also have differences in their features. Let’s compare them:

CAPTCHAs of English words contain information about the words, so we can use CNN to extract it from images. As you know, RNN can capture the interdependencies between sequences, so it is perfect for generating words by RNN based on the extracted information.

In this assignment, you have to train a captcha-recognizer which can identify English words in images. You can download this dataset here.

Description of Dataset:

  1. There are 140,000 images in this dataset containing 2500 English words(3-5 character lengths) in four different fonts.
  2. In all of these images, there are some noises(lines, curves and points) in different colors.

The captcha is shown in the figure below:

Requirements

  1. You can use any model architectures you want, as long as accomplishing the goal.
  2. You should design your own model architecture. In other words, do not load the model or any pre-trained weights directly from other sources.
  3. You should use the first 100,000 images as training data, the next 20,000 as validation data, and the rest (final 20,000) as testing data.
    • spec_train_val.txt contains the labels of only first 120,000 images.
  4. Only if the whole word matches exactly does it count as correct.
  5. You need to predict the answer to the testing data and write them in a file.
  6. Your testing accuracy should be at least 90%.

Notification: